What is Power-lifting?
Powerlifting is the ultimate strength competition. The International Powerlifting Federation is the head of nearly 100 country federations world-wide. The powerlifts:- squat, bench press and deadlift are increasingly being recognized as principal exercises in the development of an individual’s true strength. But, analyzing the factors affecting the power lifts forms the core of this project.
Problem Statement:
Analyze the various factors affecting the ability of the person to do power-lifting. This project is all about understanding this exciting and intense sport better.
Key questions that will be answered in this project are -
Solution Overview/ Approach:
We will be using a subset of the International Powerlifting Data in this project. Once we are done with the tedious but necessary task of data cleaning, we plan to perform univariate/ multi-variate analysis of the various attributes affecting the performance of the individual.We will also look at scatter plots and histograms broken down by multiple filters to be able to visually comprehend the influence of different attributes on the performance. We will also perform hypothesis testing to see if few of the relationships are statistically significant.
Finally, the relation will be explained with the help of a regression model which will give us a mathematical equation of the relationship between independent variables such as age, gender, body weight with the response variable - “ability to lift” across different divisions and world wide meets.
In short - this report aims to equip the consumer with the ability to guesstimate and be able to explain the lifting capacity of an individual, when provided with certain basic data points such as gender, age, bodyweight etc.
The packages we will be using for this project are -
library(tidyverse) # To clean data, filter, subset, plot graphs (ggplot2), organize data in nicer data formats using tibble
library(DT) # To display the data in a clean table format with options to select/ de-select columns
library(knitr) # To display the data in a clean table format, helps with commands such as head
library(plotly) # For interacticve plots
library(lattice) # To plot multiple clean looking graphs
library(broom) # To get the regression output in a neat table format
library(car) # To calculate variance inflation factor (VIF()) while fitting a multiple regression model
library(stringr) # To handle character strings better
library(lubridate) # For handling date and time columns
library(dvmisc) # To calculate RMSE (Root Mean Square Error)
library(MASS) # To use boxcox transformation on the response variableAll the packages required to run the document have been loaded.
This section talks about the various steps involved from importing the data set to the cleaning of data set. Each step of this process has been explained and the corresponding code has been shown. An option has been provided to hide the code if the reader wishes to skip reading the code. All the tables have been provided with a HTML scrollable format of the data. This will help us filter and check only the columns of interest using the option “Column Visibility” in addition to the “Search bar” given on top of the table.
At the end of this step, the data set will be ready to use for the analysis.
The data set that is being used in this project can be found on the github page: IPF Cleaned Dataset (Note - This link works only when opened in a new window)
The original source of the data set which is a much larger dataset can be found at: Open Powerlifting
This data set aims to create an accessible, accurate and open archive of the world’s powerlifting data. The original dataset is being updated every month, but the data set we are using for this project is last collected on September 20th 2019 and consists of data until August 26th 2019.
The dataset we are using in this project contains 41,152 observations with 16 variables having information on player’s gender, age, weight, lifting capacity, event participated and the date of the meet. We notice that the missing values have been recorded as "NA". Some data entries to call out are negative values in the weight lifted variables which indicate failed attempts. The dataset also contains information on people who were disqualified and guest lifters i.e. people who succeeded but aren’t eligible for awards.
Our preliminary observation suggests there might be few data entry errors, outlier values and irrelavant columns for the analysis, which have been dealt with in the subsequent data cleaning steps.
First, let us start with the important step of importing our dataset -
## 'data.frame': 41152 obs. of 16 variables:
## $ name : chr "Hiroyuki Isagawa" "David Mannering" "Eddy Pengelly" "Nanda Talambanua" ...
## $ sex : chr "M" "M" "M" "M" ...
## $ event : chr "SBD" "SBD" "SBD" "SBD" ...
## $ equipment : chr "Single-ply" "Single-ply" "Single-ply" "Single-ply" ...
## $ age : num NA 24 35.5 19.5 NA NA 32.5 31.5 NA NA ...
## $ age_class : chr NA "24-34" "35-39" "20-23" ...
## $ division : chr NA NA NA NA ...
## $ bodyweight_kg : num 67.5 67.5 67.5 67.5 67.5 67.5 67.5 90 90 90 ...
## $ weight_class_kg : chr "67.5" "67.5" "67.5" "67.5" ...
## $ best3squat_kg : num 205 225 245 195 240 ...
## $ best3bench_kg : num 140 132 158 110 140 ...
## $ best3deadlift_kg: num 225 235 270 240 215 230 235 335 310 295 ...
## $ place : chr "1" "2" "3" "4" ...
## $ date : chr "1985-08-03" "1985-08-03" "1985-08-03" "1985-08-03" ...
## $ federation : chr "IPF" "IPF" "IPF" "IPF" ...
## $ meet_name : chr "World Games" "World Games" "World Games" "World Games" ...
From the structure, we notice that the dataset has 41,152 rows and 16 columns - the dataset has been imported correctly. We see that the column names are appropriate and have suitable datatype assigned except for the column “date”, which is in character format.
We change the class of Date variable to date format
Checking the structure again to verify if the command worked and new NA’s are not introduced.
## 'data.frame': 41152 obs. of 16 variables:
## $ name : chr "Hiroyuki Isagawa" "David Mannering" "Eddy Pengelly" "Nanda Talambanua" ...
## $ sex : chr "M" "M" "M" "M" ...
## $ event : chr "SBD" "SBD" "SBD" "SBD" ...
## $ equipment : chr "Single-ply" "Single-ply" "Single-ply" "Single-ply" ...
## $ age : num NA 24 35.5 19.5 NA NA 32.5 31.5 NA NA ...
## $ age_class : chr NA "24-34" "35-39" "20-23" ...
## $ division : chr NA NA NA NA ...
## $ bodyweight_kg : num 67.5 67.5 67.5 67.5 67.5 67.5 67.5 90 90 90 ...
## $ weight_class_kg : chr "67.5" "67.5" "67.5" "67.5" ...
## $ best3squat_kg : num 205 225 245 195 240 ...
## $ best3bench_kg : num 140 132 158 110 140 ...
## $ best3deadlift_kg: num 225 235 270 240 215 230 235 335 310 295 ...
## $ place : chr "1" "2" "3" "4" ...
## $ date : Date, format: "1985-08-03" "1985-08-03" ...
## $ federation : chr "IPF" "IPF" "IPF" "IPF" ...
## $ meet_name : chr "World Games" "World Games" "World Games" "World Games" ...
## name sex event equipment
## 0 0 0 0
## age age_class division bodyweight_kg
## 2906 2884 627 187
## weight_class_kg best3squat_kg best3bench_kg best3deadlift_kg
## 1 13698 2462 14028
## place date federation meet_name
## 0 0 0 0
Let us see if the data is unique at this level -
As there is no change in the number of observations, the data is unique at this level.
Let us also rename the column ‘place’ to ‘position’ to make it more intuitive. Also we are getting rid of the suffix “kg” for few columns-
## [1] "name" "sex" "event" "equipment"
## [5] "age" "age_class" "division" "bodyweight_kg"
## [9] "weight_class_kg" "best3squat_kg" "best3bench_kg" "best3deadlift_kg"
## [13] "place" "date" "federation" "meet_name"
colnames(ipf)[13] <- "position"
colnames(ipf)[8] <- "bodyweight"
colnames(ipf)[9] <- "weight_class"
colnames(ipf)[10] <- "best3squat"
colnames(ipf)[11] <- "best3bench"
colnames(ipf)[12] <- "best3deadlift"
colnames(ipf)[14] <- "meet_date"
names(ipf)## [1] "name" "sex" "event" "equipment"
## [5] "age" "age_class" "division" "bodyweight"
## [9] "weight_class" "best3squat" "best3bench" "best3deadlift"
## [13] "position" "meet_date" "federation" "meet_name"
Now, we have fixed the datatype and column names - Let’s move ahead!
We will remove the column ‘federation’ as it has only one unique value and would not add any value to the analysis
## [1] 41152 15
Among the columns, age and age_class, we might only use either of them for our analysis. Also, as age groups are randomly distributed in this column, we might not use this column and might just stick to using the age column but let us not remove any of the columns now, which might turn out useful for the analysis later.
One possible way the age_class column would be helpful is to look at the distributions across multiple classes as this has only 16 unique values, we will retain this column for now.
## [1] NA "24-34" "35-39" "20-23" "40-44" "45-49" "16-17" "18-19"
## [9] "50-54" "13-15" "55-59" "60-64" "65-69" "70-74" "75-79" "80-999"
## [17] "5-12"
We see that one particluar class is wrongly named - so renaming age class of 80-999 to 80-99, but however as all age groups are randomly distributed, we might not use this column and might stick to just using the age column.
ipf <- mutate(ipf,age_class = ifelse(age_class == "80-999","80-99",age_class))
unique(ipf$age_class)## [1] NA "24-34" "35-39" "20-23" "40-44" "45-49" "16-17" "18-19" "50-54"
## [10] "13-15" "55-59" "60-64" "65-69" "70-74" "75-79" "80-99" "5-12"
Let us check for any similar outlier in the weight_class column:
## [1] "67.5" "90" "90+" "52" "56" "60" "75" "82.5" "100"
## [10] "110" "125" "125+" "44" "48" "67.5+" "100+" "57" "63"
## [19] "72" "84" "84+" "74" "83" "105" "120" "120+" "66"
## [28] "105+" "72+" "53" "59" "93" "43" "47" "110+" "82.5+"
## [37] NA "40" "75+"
For Weight class column, the groups are again randomly distributed with multiple classes and this column might not be useful when we try to correlate the amount of weight lifted with the weight column. However, we will try to standardize the weight class column as it has too many classes. Let us create a new column with equal spaced buckets.
(labs <- c("1-10","10-20","20-30","30-40","40-50","50-60","60-70","70-80",
"80-90","90-100","100-110","110-120","120-130","130-140","140-150",
"150-160","160-170","170-180","180-190","190-200","200-210","210-220"
,"220-230","230-240","240+"))## [1] "1-10" "10-20" "20-30" "30-40" "40-50" "50-60" "60-70"
## [8] "70-80" "80-90" "90-100" "100-110" "110-120" "120-130" "130-140"
## [15] "140-150" "150-160" "160-170" "170-180" "180-190" "190-200" "200-210"
## [22] "210-220" "220-230" "230-240" "240+"
ipf$bodyweight_class <- cut(ipf$bodyweight, breaks = c(seq(0, 240,
by = 10), Inf),labels = labs, right = FALSE)
unique(ipf$bodyweight_class)## [1] 60-70 90-100 <NA> 50-60 40-50 70-80 80-90 100-110 120-130
## [10] 110-120 130-140 140-150 150-160 160-170 170-180 190-200 180-190 30-40
## [19] 200-210 210-220 240+
## 25 Levels: 1-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 ... 240+
Now, we will look at the summary statistics of the numeric variables. This will help us understand how the values are distributed in these columns:
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.50 22.50 31.50 34.77 45.00 93.50 2906
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 37.29 60.00 75.55 81.15 97.30 240.00 187
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -210.0 160.0 215.0 217.6 270.0 490.0 13698
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -160.0 97.5 140.0 144.7 185.0 415.0 2462
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -215.0 170.0 222.5 221.8 270.0 420.0 14028
Let us also check for missing values before we determine what our next steps should be -
## name sex event equipment
## 0 0 0 0
## age age_class division bodyweight
## 2906 2884 627 187
## weight_class best3squat best3bench best3deadlift
## 1 13698 2462 14028
## position meet_date meet_name bodyweight_class
## 0 0 0 187
Having looked at the numerical summaries, we see that the data does have both extreme values and missing values. We will have to clean this dataset further, so let us deal with each of these variables separately in the next step of the data cleaning process.
We observe that there are missing values in multiple columns, we decide to clean the data by selecting each of the column and check for outliers and missing values.
For Columns - weight_class and bodyweight
As Weight Class column has only one missing value, we remove it as it doesn’t add any value.
## [1] "67.5" "90" "90+" "52" "56" "60" "75" "82.5" "100"
## [10] "110" "125" "125+" "44" "48" "67.5+" "100+" "57" "63"
## [19] "72" "84" "84+" "74" "83" "105" "120" "120+" "66"
## [28] "105+" "72+" "53" "59" "93" "43" "47" "110+" "82.5+"
## [37] "40" "75+"
Missing value for the Weight Class column has been removed.
We can impute for missing values of bodyweight from the column weight class but again, this wouldn’t make much sense as the weight class is a range and the tradeoff is making data available for just 187 rows (~0.5% of data) that too with error after a lot data computation. So, we just go ahead and remove these rows.
## name sex event equipment
## 0 0 0 0
## age age_class division bodyweight
## 2830 2808 613 0
## weight_class best3squat best3bench best3deadlift
## 0 13663 2452 13990
## position meet_date meet_name bodyweight_class
## 0 0 0 0
We also see that the missing values for Body Weight column have been removed. Also, as we do not get any additional information from the old column - “weight_class”, So we just get rid of this column and see if the dataset is unique at this level.
## [1] "name" "sex" "event" "equipment"
## [5] "age" "age_class" "division" "bodyweight"
## [9] "weight_class" "best3squat" "best3bench" "best3deadlift"
## [13] "position" "meet_date" "meet_name" "bodyweight_class"
Let us check for outliers in the column bodyweight now,
boxplot(ipf$bodyweight, main="Boxplot for Bodyweight", col='powderblue',
whiskcol="#4900FFFF",
medcol="#0092FFFF",
staplecol="#0024FFFF",
outcol="#FF000099")As We observe that there are lot of outliers in the higher quartile,let us follow the standard 1.5(IQR) approach to deal with the outliers
(outlier_bodyweight <- quantile(ipf$bodyweight, c(0.75), na.rm = TRUE) +
(1.5 * IQR(ipf$bodyweight, na.rm = TRUE)))
We check for outliers in the body weight column above 153.25 to see the number of rows that have the value of 153.25 and above
We observe that there are 410 observations. We decide to go ahead and remove these observations
For Columns - age and age class
We will check if we can derive the missing values from either of the 2 columns age and age class.
## name sex event equipment
## 0 0 0 0
## age age_class division bodyweight
## 2823 2801 610 0
## best3squat best3bench best3deadlift position
## 13476 2405 13802 0
## meet_date meet_name bodyweight_class
## 0 0 0
We observe that there are 2823 and 2801 missing values in the columns age and age class.
## [1] 0.5
## age age_class
## 1 0.5 <NA>
From the row with value equal to 0.5, we cannot infer anything meaningful i.e., we will not be able to impute the missing values from this column and this could a data entry error.
Now, as we cannot infer the value of age from this, it doesn’t make sense to impute these values. So we decide leave these rows as it is and We will not remove these rows as we would lose ~7% of the data if we do so, also we can use the information from other columns for these rows.
## [1] "numeric"
Now, let us look at the outliers for the column age.
boxplot(ipf$age,main="Boxplot for Age",col='powderblue',
whiskcol="#4900FFFF",
medcol="#0092FFFF",
staplecol="#0024FFFF",
outcol="#FF000099")We notice that there are lot of outliers in the extremities of higher scale.
## 75%
## 80
We check for outliers above the age of 80 in the age column and let us see how many rows have a value of 80 and above.
We decide to remove these rows where the age is greater than 80 - as we only have 41 observations. Also we get rid of the observation, where age is 0.5, which seems to be a clear anomaly. Also, we will get rid of the row in which the age class is mentioned as “5-12” as it has only one record.
ipf <- ipf %>%
filter((age <= 80) %>% replace_na(TRUE))
ipf <- ipf %>%
filter(age != 0.5 | is.na(age))
ipf <- ipf %>%
filter(age_class != "5-12" | is.na(age_class))
colSums(is.na(ipf))## name sex event equipment
## 0 0 0 0
## age age_class division bodyweight
## 2823 2800 610 0
## best3squat best3bench best3deadlift position
## 13445 2404 13773 0
## meet_date meet_name bodyweight_class
## 0 0 0
For the division column, there are 610 missing values and we can impute them if the distribution of age within classes is consistent. This doesn’t seem to be the case here and hence these values cannot be imputed.
However, We will not remove the missing values now as we are not sure if this variable will be used in the analysis. Also, considering the number of NA (~1.5% of the total data), we decided to leave them as it is. Another major reason is that these values belong to the time period between 1985-1994, which could be useful information from other columns while doing the analysis.
Also, let us look at the unique values and count for each of the type of event.
##
## B SB SBD
## 12343 2 28166
As we have only 2 observations for SB, we can and hence can remove these observations. The rationale for removing is that 2 is a very small number to understand pattern or draw insights from.
Now, we are left with 3 numeric columns - best3squat, best3bench and best3deadlift.
## name sex event equipment
## 0 0 0 0
## age age_class division bodyweight
## 2823 2800 610 0
## best3squat best3bench best3deadlift position
## 13445 2404 13771 0
## meet_date meet_name bodyweight_class
## 0 0 0
Although we have approximately 13,500 and 13,800 missing values for squat and deadlift, we are not really concerned about them as a majority of them fall under a particuar event “B”, where they are not supposed to have these values.This is shown below
## name sex event equipment
## 0 0 0 0
## age age_class division bodyweight
## 2519 2498 327 0
## best3squat best3bench best3deadlift position
## 1102 1349 1428 0
## meet_date meet_name bodyweight_class
## 0 0 0
Now, We observe that the resulting dataset with the event type as “SBD” has approximately 1100 to 1500 missing values as compared to the previously seen ~13k missing values. So, we subset the dataset into 2 datasets, one for Bench and one for SBD
ipf_Bench <- ipf %>%
filter(event == 'B' %>% replace_na(TRUE))
ipf_SBD <- ipf %>%
filter(event == 'SBD' %>% replace_na(TRUE))Now, we will have to look at the 3 variables - Bench, Deadlift and Squat for treatment.
However, only the variable bench for the dataset ipf_Bench,
boxplot(ipf_Bench$best3bench, main="Boxplot for Best3bench",col='powderblue',
whiskcol="#4900FFFF",
medcol="#0092FFFF",
staplecol="#0024FFFF",
outcol="#FF000099")(outlier_bench <- quantile(ipf_Bench$best3bench, c(0.75), na.rm = TRUE) +
(1.5 * IQR(ipf_Bench$best3bench, na.rm = TRUE)))## 75%
## 347.5
We observe that there are outliers above the best3bench of 347.5. Let us see how many rows have value above 347.5.
We decide to remove these rows where the value is above 347.5 as we have 22 observations only.
We Check for outliers for negative values and observed that we do not have any.
Now, let us proceed to look at the other dataset where we have observations for all the 3 variables - Squat, Deadlift and Bench i.e. we will look at the dataset ipf_SBD.
For bench in event SBD,
boxplot(ipf_SBD$best3bench,main="Boxplot for Best3bench",col='powderblue',
whiskcol="#4900FFFF",
medcol="#0092FFFF",
staplecol="#0024FFFF",
outcol="#FF000099")We observe that there are outliers at both the extremities. Although, we have a reason for the existence of negative values, we go ahead and remove them as they are supposed to be rare events with very limited number of observations for us to draw any conclusions or insights.
(outlier_bench_SBD_upper <- quantile(ipf_SBD$best3bench, c(0.75), na.rm = TRUE) +
(1.5 * IQR(ipf_SBD$best3bench, na.rm = TRUE)))## 75%
## 292.5
(outlier_bench_SBD_lower <- quantile(ipf_SBD$best3bench, c(0.25), na.rm = TRUE) -
(1.5 * IQR(ipf_SBD$best3bench, na.rm = TRUE)))## 25%
## -27.5
We see outliers above 292.5 and below -27.5, let us see check for number of rows having values above 292.5 and below -27.5.
There are 151 observations and we decide to remove these rows where the value is above 292.5 and below -27.5.
For squat in event SBD,
boxplot(ipf_SBD$best3squat,main="Boxplot for Best3squat",col='powderblue',
whiskcol="#4900FFFF",
medcol="#0092FFFF",
staplecol="#0024FFFF",
outcol="#FF000099")Again, we notice that there are outliers on both positive and negative ends.
(outlier_squat_SBD_upper <- quantile(ipf_SBD$best3squat, c(0.75), na.rm = TRUE) +
(1.5 * IQR(ipf_SBD$best3squat, na.rm = TRUE)))## 75%
## 422.5
(outlier_squat_SBlower <- quantile(ipf_SBD$best3squat, c(0.25), na.rm = TRUE) -
(1.5 * IQR(ipf_SBD$best3squat, na.rm = TRUE)))## 25%
## 2.5
Here, we observe outliers above 422.5 and below 2.5, let us check for the number of rows that have values above 422.5 and below 2.5.
There are 25 observations and we decide to remove these observations.
For deadlift in event SBD,
boxplot(ipf_SBD$best3deadlift,main="Boxplot for Best3deadlift",col='powderblue',
whiskcol="#4900FFFF",
medcol="#0092FFFF",
staplecol="#0024FFFF",
outcol="#FF000099")We notice that there are outliers only in the lower extremity. However, let us check on both extremities for outliers and also the number of rows that have value above 420 and below 20.
(outlier_deadlift_SBD_upper <- quantile(ipf_SBD$best3deadlift, c(0.75), na.rm = TRUE)
+ (1.5 * IQR(ipf_SBD$best3deadlift, na.rm = TRUE)))## 75%
## 420
(outlier_deadlift_SBD_lower <- quantile(ipf_SBD$best3deadlift, c(0.25), na.rm = TRUE)
- (1.5 * IQR(ipf_SBD$best3deadlift, na.rm = TRUE)))## 25%
## 20
ipf_outlier_deadlift_SBD <- filter(ipf_SBD,best3deadlift > 420 |
best3deadlift < 20,replace_na(TRUE))There are 3 observations and We decide to get rid of these observations.
As we evaluated the event types separately to avoid removal of values that are not outliers within a specific group, we will now combine them to form the final dataset and reorder the columns.
Performing few sanity checks on the combined dataset and displaying the head for the final dataset we will be using for the analysis
Finally, after cleaning the dataset, we are left with a dataframe of dimensions 40308, 15
Let us look at the summary statistics for the numerical variables:
age: The minimum and maximum values are 13.5 and 80 after removing ~40 outlier observations. The descriptive statistics are shown below:
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 13.50 22.50 31.50 34.78 45.50 80.00 2822
bodyweight: The minimum and maximum values are 37.29 and 94.80 after removing ~410 outlier observations. The descriptive statistics are column is shown below:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 37.29 60.00 75.00 80.10 94.80 153.24
best3squat: The minimum and maximum values are 25 and 422.5 after removing ~25 outlier observations. The descriptive statistics are column is shown below:
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 25.0 160.0 212.5 215.0 265.0 422.5 13415
best3bench: The minimum and maximum values are 25 and 347.5 after removing ~170 outlier observations. The descriptive statistics are column is shown below:
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 25.0 95.0 140.0 142.7 182.5 347.5 2398
best3deadlift: The minimum and maximum values are 25 and 420 after removing ~3 outlier observations. The descriptive statistics are column is shown below:
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 25.0 170.0 220.0 220.3 270.0 420.0 13739
From the summary statistics of the numeric columns, we observe that the distribution of the data points for the variables is not skewed as the difference between Mean and Median is minimal.
A few comments on the dataset -
We have removed outliers and treated NA’s to a major extent
For the column - “age”, we could have derived some of the NA values from the column “age_class”
Although we will not be using the “name” column in our analysis - we will retain this column in the main dataset due to the following reason
Data Dictionary:
| variable_name | data_type | description |
|---|---|---|
| name | character | Individual lifter name |
| sex | character | Binary gender (M/F) |
| event | character | The type of competition that the lifter entered - unique values are SBD(Squat Bench Deadlift) and B(Bench) |
| equipment | character | The equipment category under which the lifts were performed. Values such as Raw (Bare Knees or Sleeves), Wraps(with knee wraps) and Single-ply (single-ply suits) |
| age | double | The age of the lifter on the start date of the meet, if known. |
| age_class | character | The age class in which the filter falls, for example 40-45 |
| division | character | Text describing the division of competition, like Open or Juniors 20-23 or Professional. |
| bodyweight | double | The recorded bodyweight of the lifter at the time of competition, to two decimal places. |
| bodyweight_class | character | The weight class in which the lifter competed, to two decimal places. Max specified by number and min specified by the symbol “+” to the right |
| best3squat | double | Maximum of the first three successful attempts for the lift - squat. |
| best3bench | double | Maximum of the first three successful attempts for the lift - bench. |
| best3deadlift | double | Maximum of the first three successful attempts for the lift - deadlift. |
| position | character | The recorded place of the lifter in the given division at the end of the meet. Special Values: G - Guest Lifter, DQ - Disqualified, DD - Doping Disqualification |
| meet_date | date | The date of the meet |
| meet_name | character | The name of the meet without the year. Can be seen in conjuction with the date column |
We have structured this section into 3 parts -
Firstly, let us look at the distributions of the numeric variables: Age, Bodyweight, Best3squat, Best3bench & Best3deadlift.
Distribution for Age Variable :-
ggplot(data = ipf_main, aes(age)) +
geom_histogram(binwidth = 5, col = I("violet"), fill = I("brown")) +
ggtitle("Distribution of age") +
labs(x = "Age", y = "Frequency") +
theme(plot.title = element_text(hjust = 0.5))We notice a right skewed distribution and most of the participation instances are by people of age ~20.
Distribution for Bodyweight Variable :-
ggplot(data = ipf_main, aes(bodyweight)) +
geom_histogram(binwidth = 5, col = I("violet"), fill = I("brown")) +
ggtitle("Distribution of Bodyweight") +
labs(x = "Bodyweight", y = "Frequency") +
theme(plot.title = element_text(hjust = 0.5))A majority of participants have their weights between 40 and 100.
Distribution for Best3squat Variable :-
ggplot(data = ipf_main, aes(best3squat)) +
geom_histogram(binwidth = 5, col = I("violet"), fill = I("brown")) +
ggtitle("Distribution of Best3squat") +
labs(x = "Best3squat", y = "Frequency") +
theme(plot.title = element_text(hjust = 0.5))Most of them lift between 100kgs and 300 kgs in the squat category and the distribution closely resembles that of a normal distribution.
Distribution for Best3bench Variable.
ggplot(data = ipf_main, aes(best3bench)) +
geom_histogram(binwidth = 5, col = I("violet"), fill = I("brown")) +
ggtitle("Distribution of Best3bench") +
labs(x = "Best3bench", y = "Frequency") +
theme(plot.title = element_text(hjust = 0.5))Most of them lift between 70kgs and 200 kgs in the bench press category and the distribution is slightly skewed towards the right.
Distribution for Best3deadlift Variable.
ggplot(data = ipf_main, aes(best3deadlift)) +
geom_histogram(binwidth = 5, col = I("violet"), fill = I("brown")) +
ggtitle("Distribution of Best3deadlift") +
labs(x = "Best3deadlift", y = "Frequency") +
theme(plot.title = element_text(hjust = 0.5))Most of them lift between 120kgs and 320 kgs in the deadlift category and the distribution closely resembles that of a normal distribution.
Character Variables:-
ggplot(data = ipf_main, aes(x = sex)) +
geom_bar() +
ggtitle("Male vs Female participants") +
labs(x = "Sex",y = "Number of Observations") +
theme(plot.title = element_text(hjust = 0.5))ggplot(data = ipf_main, aes(x = event)) +
geom_bar() +
ggtitle("Split between events") +
labs(x = "Event",y = "Number of Observations") +
theme(plot.title = element_text(hjust = 0.5))ggplot(data = ipf_main, aes(x = equipment)) +
geom_bar() +
ggtitle("Split between equipments") +
labs(x = "Equipment",y = "Number of Observations") +
theme(plot.title = element_text(hjust = 0.5))A quick summary of the few important character variables tells us that we almost have double the number of observations for males as compared to females. This is the same case with equipment where SBD has ~2X number of observations as comapred to the event B.
We also notice that the equipment “single ply” has the most available information while Wraps has too less information for analysis.
We will center our EDA on the distribution of variables such as age, bodyweight, equipment and how they impact the weight lifted by individuals.
Let us analyze how the factors like Age and Sex are affecting the weight lifting capacity.
For this, we decided to subset the dataset by event “SBD” and create a new variable for the total weight lifted in the three events i.e., Bench, Squat and Deadlift after replacing the missing values with zero.
ipf_main_SBD <- ipf_main %>%
filter(event == "SBD")
ipf_main_SBD$best3squat <- ipf_main_SBD$best3squat %>%
replace_na(0)
ipf_main_SBD$best3bench <- ipf_main_SBD$best3bench %>%
replace_na(0)
ipf_main_SBD$best3deadlift <- ipf_main_SBD$best3deadlift %>%
replace_na(0)
ipf_main_SBD <- ipf_main_SBD %>%
mutate(total_weight <- best3squat + best3deadlift + best3bench)
colnames(ipf_main_SBD)[16] <- "totalweight"
ipf_main_SBD <- filter(ipf_main_SBD,totalweight > 0)Let us now plot a scatterplot to look at the impact of age on total weight lifted.
ggplot(data=ipf_main_SBD, aes(age,totalweight)) +
geom_point() +
geom_smooth(method = "auto") +
ggtitle("Impact of age on total weight lifted") +
labs(x = "Age", y = "Total Weight Lifted") +
theme(plot.title = element_text(hjust = 0.5))From the above Scatterplot, we observe that the weight lifting capacity is higher among the age groups between 20 to 40 and then, we can observe a downward trend from there on. Though there’s downward trend with the increase in age, it’s interesting to see that people lying in the age group 60-80 are still able to lift weights of ~600 kgs too.
Let us now do a scatterplot to look at the impact of age on different types of lift - best3squat, best3bench & best3deadlift.
ipf_main %>%
filter(event == "SBD") %>%
dplyr::select(name, age, best3squat, best3bench, best3deadlift) %>%
group_by(name, age) %>%
gather(Type, weights, 3:5) %>%
ggplot(aes(x = age, y = weights)) +
geom_point(alpha = 0.12) +
geom_smooth(method = 'auto') +
facet_wrap(~ Type, nrow = 1) +
labs(x = "Age", y = "Weight Lifted") +
ggtitle("Overall distribution of age with weight lifted for different types of lifts") +
theme(plot.title = element_text(hjust = 0.5))The decline in the lifting capacity is almost same across all 3 types of lifts. However, squat shows a sharper decline in the lifting capacity after the age of 40 as compared to the gradual decline in the other 2 types of lifts.
Now, let us use a scatterplot to look at the variance in weight lifting capacities between male and female powerlifters
ggplot(data = ipf_main_SBD, aes(age,totalweight)) +
geom_point(alpha = 0.2) +
facet_wrap(~sex) +
geom_smooth(method = "auto") +
ggtitle("Impact of Age among male and female powerlifters on total weight lifted") +
labs(x = "Age", y = "Total Weight") +
theme(plot.title = element_text(hjust = 0.5))We can observe that the weight lifting capacities are higher in male powerlifters than female powerlifters. Especially, when we compare male and female powerlifters for age group between 20 and 40, the total weight lifted is around 1000 kgs for males and 700 kgs for females.
One interesting thing is that we see a downward trend in weight lifting capacities for females starting at the age of 30 yrs and for males it starts slightly later at the age of 35 yrs. The powerlifters from both the genders are able to lift reasonably higher weights even after an age of 60 yrs.
Now, let us understand how the lifting capacity varies by different lifts.
ipf_main %>%
filter(event == "SBD") %>%
dplyr::select(name, age, sex, c(10:12)) %>%
group_by(name, age, sex) %>%
gather(Type, weights, 4:6) %>%
ggplot(aes(x = age, y = weights)) +
geom_point(alpha = 0.12) +
geom_smooth(method = 'auto') +
facet_grid(sex ~ Type) +
labs(x = "Age", y = "Total Weight") +
theme(plot.title = element_text(hjust = 0.5)) +
ggtitle("Distribution of age by gender for different types of weight lifting")The weight lifted in squat seems to be the most impacted by gender in terms of the weight lifted, although there is a very big impact of gender on the weight lifted overall. Also the weight lifted starts to decline with age faster for females as compared to males by a small margin, this is more evident in the cases of deadlift and squat.
Let us observe if age is related to the equipment usage in powerlifting. For this, let us look at the distribution of age classes for each of the equipments being used.
ggplot(data = ipf_main, aes(x = equipment, fill = factor(age_class))) +
geom_bar(position = "fill") +
ggtitle("Distribution of age classes for different equipment") +
labs(x = "Equipment", y = "Cummulative Percentage of Usage") +
theme(plot.title = element_text(hjust = 0.5)) +
guides(fill=guide_legend(title = "Age Class"))From the above bar plot, we see that the distribution of age classes remains almost constant across the equipment types - Raw and Single-ply.
An intersting thing is that people between ages 16 to 19 prefer to perform lifts with bare knees or knee sleeves.
There is no indication that the usage of equipment is being impacted by the persons age as the distribution of age classes appears to be constant across different equipment types. We do not have enough information to comment on the distribution of age classes using wraps.
Let us now look at how using bodyweight is impacting the total weight lifted using different equipments.
ggplot(data = ipf_main_SBD, aes(bodyweight,totalweight)) +
geom_point() +
facet_wrap(~equipment) +
geom_smooth(method = "auto") +
ggtitle("Impact of equipment usage and bodyweight on Total Weight Lifted") +
labs(x = "Bodyweight", y = "totalweight") +
theme(plot.title = element_text(hjust = 0.5))From the above plots, we can clearly see that there is an increase in the total weight lifted using single-ply equipment when compared to raw weight lifting. This suggests a strong correlation between total weight lifting capacity and equipment usage.
Another intersting observation is that the increase in total weight lifted is almost negligible when body weight increases beyond 90kgs. Further, We will try to statistically prove whether there is strong relation between total weight lifted and equipment usage through hypothesis testing.
Let us create two subsets, one for single-ply equipment weight lifting and the other for raw weight lifting. Let us perform Welch Two Sample test to prove that the difference between the means for two samples is greater than zero.
ipf_singleply <- ipf_main_SBD %>%
group_by(equipment) %>%
filter(equipment == "Single-ply")
ipf_raw <- ipf_main_SBD %>%
group_by(equipment) %>%
filter(equipment == "Raw")
t.test(ipf_singleply$totalweight, ipf_raw$totalweight, alternative = "greater")##
## Welch Two Sample t-test
##
## data: ipf_singleply$totalweight and ipf_raw$totalweight
## t = 27.871, df = 8997.4, p-value < 2.2e-16
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 69.24364 Inf
## sample estimates:
## mean of x mean of y
## 572.6758 499.0888
From the results of the tests, we notice that alternative hypothesis i.e., the difference between the means for two samples, using single-ply and raw weight lifting, is greater than zero, is true. We can conclude that the average total weight lifted is higher using an equipment like single-ply when compared to raw weight lifting.
Let us now plot a barplot to observe the distribution of usage of equipment among male and female weight lifters.
ipf_4 <- ipf_main %>%
group_by(sex, equipment) %>%
summarise(count=n())
ggplot(data = ipf_main, aes(x = sex,fill = factor(equipment))) +
geom_bar(position = "fill") +
ggtitle("Distribution of Equipment usage among male and female weight lifters") +
labs(x = "Sex", y = "Cummulative Percentage of Usage") +
theme(plot.title = element_text(hjust = 0.5)) +
guides(fill=guide_legend(title = "Equipment"))From the above barplot, we observe that the lifters of both the genders prefer to use single-ply equipment more than raw. We can notice that there is only a slight difference in percentage of usage of single-ply equipment between males and females.
Further, Let us also look at the impact of equipment usage on weight lifting capacity stands true for both the genders. We will perform hypothesis tests to prove that the average total weight lifted using single-ply equipment is greater than raw weight lifting for both male and female lifters.
ipf_singleply_male <- ipf_singleply %>%
group_by(sex) %>%
filter(sex == "M")
ipf_raw_male <- ipf_raw %>%
group_by(sex) %>%
filter(sex == "M")
t.test(ipf_singleply_male$totalweight, ipf_raw_male$totalweight, alternative = "greater")##
## Welch Two Sample t-test
##
## data: ipf_singleply_male$totalweight and ipf_raw_male$totalweight
## t = 18.182, df = 5358.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 45.3992 Inf
## sample estimates:
## mean of x mean of y
## 653.4089 603.4932
From the above results, we see that the usage of single ply equipment by males increases the amount of weight lifted by them, which is in line with the overall population.
ipf_singleply_female <- ipf_singleply %>%
group_by(sex) %>%
filter(sex == "F")
ipf_raw_female <- ipf_raw %>%
group_by(sex) %>%
filter(sex == "F")
t.test(ipf_singleply_female$totalweight, ipf_raw_female$totalweight, alternative = "greater")##
## Welch Two Sample t-test
##
## data: ipf_singleply_female$totalweight and ipf_raw_female$totalweight
## t = 24.825, df = 4931.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 48.42398 Inf
## sample estimates:
## mean of x mean of y
## 401.5128 349.6519
We see similar behavior among females as well. There are no surprises here.
Let us try to profile which section of population tend to get disqualified from various meets. We will start off by looking at the gender -
ipf_main %>%
filter(position == "DQ") %>%
group_by(sex) %>%
ggplot(aes(sex), fill = factor(name)) +
geom_bar(position = "dodge") +
ggtitle("Distribution of disqualified people by sex") +
labs(x = "Sex", y = "Count of Disqualification Instances") +
theme(plot.title = element_text(hjust = 0.5))We see that instances of males being disqualified are 4 times more likely as compared to females. Let us see if the number of males being disqualified is being driven by a certain age group.
ipf_main %>%
filter(position == "DQ", sex == "M") %>%
group_by(age_class) %>%
ggplot(aes(age_class), fill = factor(name)) +
geom_bar(position = "dodge") +
ggtitle("Distribution of disqualified males by age class") +
labs(x = "Age Class", y = "Count of Disqualification Instances") +
theme(plot.title = element_text(hjust = 0.5))A major chunk of disqualification instances occur in the age group 24 - 34 followed by the age group 20 - 23.
Now, let us see if they happen in the same meet or distributed between different meets.
ipf_main %>%
filter(position == "DQ") %>%
group_by(meet_name) %>%
ggplot(aes(meet_name), fill = factor(name)) +
geom_bar(position = "dodge") +
coord_flip() +
ggtitle("Distribution of disqualified people across meets") +
labs(x = "Meet Name", y = "Count of Disqualification Instances") +
theme(plot.title = element_text(hjust = 0.9))A lot of disqualifications occur in “World bench press championship” followed by worlds masters powerlifting championship and Men’s world powerlifting championship.
Now let us see if the bodyweight or age of the person impacts the final position of that person in the meet. We will look at the correlation values to check this.
ipf_main_subset1 <- ipf_main %>%
filter(position != "DD" , position != "DQ" , position != "G") %>%
mutate(position_new <- as.numeric(position)) %>%
rename(position_new = "position_new <- as.numeric(position)")
cor(ipf_main_subset1$bodyweight,
ipf_main_subset1$position_new, use = "complete.obs")## [1] 0.08960862
## [1] -0.1488544
We only consider the numeric values of the position variable to check if there is any correlation between the position and bodyweight of the person. The correlation values are very low which suggests that there is no direct impact of bodyweight on the position.
Let us now observe the distribution of weight lifters by sex across various meets.
ipf_main %>%
ggplot(aes(meet_name, fill = factor(sex))) +
geom_bar(position = "fill") +
ggtitle("Distribution of weight lifters by sex across meets") +
labs(x = "Meet Name", y = "Gender Split Percentage") +
theme(plot.title = element_text(hjust = 0.8)) +
guides(fill=guide_legend(title = "Sex")) +
coord_flip() Males participation is significantly greater than females participation in meets such as “world disabled bench press championship” and “world masters powerlifting championship”. One interesting thing is that there is higher male participation in all the bench press championships.
Let us add a year,month and date column which will help us look at trends over time.
ipf_main <- ipf_main %>%
mutate("Year" <- year(lubridate::ymd(meet_date)),
"Month" <- month(lubridate::ymd(meet_date)),
"YearMonth" <- ymd(paste(Year,"-",Month,"-","01")))
colnames(ipf_main)[16] <- "Year"
colnames(ipf_main)[17] <- "Month"
colnames(ipf_main)[18] <- "YearMonth"Let us see if participation by gender is varying over time. As there are multiple meets - we will look at the top 3 meets by participation
freq_count <- ipf_main %>%
group_by(meet_name) %>%
summarise(count = n()) %>%
arrange(desc(count))
kable(head(freq_count, n = 3))| meet_name | count |
|---|---|
| World Bench Press Championships | 7036 |
| World Masters Powerlifting Championships | 5936 |
| World Classic Powerlifting Championships | 3903 |
ipf_main %>%
filter(meet_name == "World Bench Press Championships") %>%
group_by(Year) %>%
ggplot(aes(Year, fill = factor(sex))) +
geom_bar(position = "fill") +
coord_flip() +
ggtitle("Gender variation across years",
subtitle = "World Bench Press Championships") +
labs(x = "Year", y = "Gender Split Percentage") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust=0.5)) +
guides(fill=guide_legend(title = "Sex"))ipf_main %>%
filter(meet_name == "World Classic Powerlifting Championships") %>%
group_by(Year) %>%
ggplot(aes(Year, fill = factor(sex))) +
geom_bar(position = "fill") +
coord_flip() +
ggtitle("Gender variation across years",
subtitle = " World Classic Powerlifting Championships") +
labs(x = "Year", y = "Gender Split Percentage") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust=0.4)) +
guides(fill=guide_legend(title = "Sex"))ipf_main %>%
filter(meet_name == "World Masters Bench Press Championships") %>%
group_by(Year) %>%
ggplot(aes(Year, fill = factor(sex))) +
geom_bar(position = "fill") +
coord_flip() +
ggtitle("Gender variation across years",
subtitle = " World Masters Bench Press Championships") +
labs(x = "Year", y = "Gender Split Percentage") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust=0.4)) +
guides(fill=guide_legend(title = "Sex"))We looked at the top 3 meets by participation and below are the findings -
World Bench Press Championships - The male participation percentages are highly fluctuating and the male participation continues to be 60% or higher for most of the years starting from 1990.
World Classic Powerlifting Championships - There is a visible increase in women participation from 2013.
World Masters Bench Press Championships - The male participation is on the higher scale from 2000, with a slight improvement being shown in the last 3 to 4 years.
Let us now plot a barplot to observe the distribution of powerlifter’s Age Classes across various meets they participated.
ggplot(data = ipf_main, aes(x = meet_name,fill = factor(age_class))) +
geom_bar(position = "fill") +
ggtitle("Distribution of Age Classes across various Meets") +
labs(x = "Meet Name", y = "Cummulative Percentage") +
theme(plot.title = element_text(hjust = 1)) +
guides(fill=guide_legend(title = "Age Class")) +
coord_flip()From the above graph, we observe that the most of the participants across various meets belong to the 18-39 yrs age group. Interestingly, we can see that in 5 of these meets, there is significant participation of powerlifters belonging to the 40-64 yrs age group.
ipf_main %>%
group_by(Year) %>%
summarize(count= n()) %>%
ggplot(aes(x = Year, y = count)) +
geom_line() +
ggtitle("Trend of Participation over time") +
labs(x = "Year", y = "Participation COunt") +
theme(plot.title = element_text(hjust = 0.5))We see a an increase in participation of powerlifting over the years, Although the rise has been slow in the initial few years, it has really picked up in the last 10 years.
We can check if this is the case with both males and females -
ipf_main %>%
filter(sex == "M") %>%
group_by(Year) %>%
summarize(count= n()) %>%
ggplot(aes(x = Year, y = count)) +
geom_line() +
ggtitle("Trend of Participation over time for males") +
labs(x = "Year", y = "Participation Count") +
theme(plot.title = element_text(hjust = 0.5))ipf_main %>%
filter(sex == "F") %>%
group_by(Year) %>%
summarize(count= n()) %>%
ggplot(aes(x = Year, y = count)) +
geom_line() +
ggtitle("Trend of Participation over time for females") +
labs(x = "Year", y = "Participation Count") +
theme(plot.title = element_text(hjust = 0.5))We see that the increase in participation for females has been higher in the recent 10 years as compared to the steady increase in the participation of males.
One other interesting thing to see is if the lifting capacity of males and females have increased over the years
ipf_main_SBD <- ipf_main_SBD %>%
mutate("Year" <- year(lubridate::ymd(meet_date)),
"Month" <- month(lubridate::ymd(meet_date)),
"YearMonth" <- ymd(paste(Year,"-",Month,"-","01")))
colnames(ipf_main_SBD)[17] <- "Year"
colnames(ipf_main_SBD)[18] <- "Month"
colnames(ipf_main_SBD)[19] <- "YearMonth"
ipf_main_SBD %>%
filter(sex == "M") %>%
group_by(Year) %>%
summarise(avg_weight = mean(totalweight)) %>%
ggplot(aes(x = Year)) +
geom_line(aes(y = avg_weight)) +
labs(x= "Year", y = "Mean Total Weight Lifted") +
ggtitle("Average weight lifted over time for males") +
theme(plot.title = element_text(hjust = 0.5))ipf_main_SBD %>%
filter(sex == "F") %>%
group_by(Year) %>%
summarise(avg_weight = mean(totalweight)) %>%
ggplot(aes(x = Year)) +
geom_line(aes(y = avg_weight)) +
labs(x= "Year", y = "Mean Total Weight Lifted") +
ggtitle("Average weight lifted over time for females") +
theme(plot.title = element_text(hjust = 0.5))These graphs show that the average weight lifted has been highly fluctuating for males in the last 40 years. However, for females - the average weight lifted has been trending upwards from 1980 till 2009 and then shows a declining trend. A thing which stands out here is that, the year 2009 is when women had shown peak performance in powerlifting over the last 40 years.
Methodology
We will subset the dataset to just include event SBD as it has weight lifts across all 3 events. The reason we do this is to use the response variable as total weight lifted and reduce skew to the maximum possible extent at the same time.
We currently have ~27k observations and we get rid of NA values which will reduce the number of observations by ~3.5K. As we are still left with ~24k observations, which are good enough to fit a regression model - we go ahead and omit the NA values.
Then, we further subset to just include variables which makes sense. The variables which we want to include which could have a possible impact on the total weight lifted are - sex, equipment, age, bodyweight and total_weight. So,we add in the variable total_weight to this subset - which is the sum of weights lifted in deadlift, bench and squat.
We will have to convert the variables sex, equipment to factors to be able to include them in the lm function.
We then choose a base model with the help of backward elimination. The scope for this procedure is between an intercept model and a model with all the variables upto the inclusion of all 2 - way interactions.
ipf_main_subset2 <- ipf_main %>%
filter(event == "SBD")
ipf_main_subset2 <- ipf_main_subset2 %>%
dplyr::select(sex, equipment, age, bodyweight, best3squat, best3deadlift, best3bench) %>%
group_by(sex, equipment, age, bodyweight, best3squat, best3deadlift, best3bench)
ipf_main_subset2 <- na.omit(ipf_main_subset2)
ipf_main_subset2 <- ipf_main_subset2 %>%
mutate(total_weight <- best3squat + best3deadlift + best3bench)
colnames(ipf_main_subset2)[8] <- "totalweight"
ipf_main_subset2 <- ipf_main_subset2 %>%
group_by(sex, equipment, age, bodyweight, totalweight) %>%
dplyr::select(sex, equipment, age, bodyweight, totalweight) %>%
ungroup()
ipf_main_subset2$sex <- as.factor(ipf_main_subset2$sex)
ipf_main_subset2$equipment <- as.factor(ipf_main_subset2$equipment)
fit_min <- lm(totalweight ~ 1, data = ipf_main_subset2)
#summary(fit_min)
fit_max <- lm(totalweight ~ .^2, data = ipf_main_subset2)
#summary(fit_max)
fit_be <- step(fit_max, direction = "backward",
trace = 0, k = log(nrow(ipf_main_subset2)))Once we performed the necessary data cleaning and have fit the model, let us define some functions to look at the model metrics such as PRESS residuals, AIC, Adj. R^2 which will help us understand the accuracy or fit of the model.
PRESS <- function(object, ...) {
if(!missing(...)) {
res <- sapply(list(object, ...), FUN = function(x) {
sum(rstandard(x, type = "predictive") ^ 2)
})
names(res) <- as.character(match.call()[-1L])
res
} else {
sum(rstandard(object, type = "predictive") ^ 2)
}
}
modelMetrics <- function(object, ...) {
if(!missing(...)) {
res <- sapply(list(object, ...), FUN = function(x) {
c("AIC" = AIC(x), "BIC" = BIC(x),
"adjR2" = summary(x)$adj.r.squared,
"RMSE" = sigma(x), "PRESS" = PRESS(x),
"nterms" = length(coef(x)))
})
colnames(res) <- as.character(match.call()[-1L])
res
} else {
c("AIC" = AIC(object), "BIC" = BIC(object),
"adjR2" = summary(object)$adj.r.squared,
"RMSE" = sigma(object), "PRESS" = PRESS(object),
"nterms" = length(coef(object)))
}
}
kable(round(modelMetrics(fit_be),3))| x | |
|---|---|
| AIC | 2.805656e+05 |
| BIC | 2.806625e+05 |
| adjR2 | 7.500000e-01 |
| RMSE | 9.125500e+01 |
| PRESS | 1.970247e+08 |
| nterms | 1.100000e+01 |
##
## Call:
## lm(formula = totalweight ~ sex + equipment + age + bodyweight +
## sex:age + sex:bodyweight + equipment:bodyweight + age:bodyweight,
## data = ipf_main_subset2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -619.01 -55.25 -0.49 57.85 332.56
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 153.085233 7.305437 20.955 < 2e-16 ***
## sexM 144.220603 6.011917 23.989 < 2e-16 ***
## equipmentSingle-ply 26.555316 5.149745 5.157 2.53e-07 ***
## equipmentWraps -37.376752 43.460597 -0.860 0.389790
## age 0.526839 0.157522 3.345 0.000825 ***
## bodyweight 3.552207 0.098484 36.069 < 2e-16 ***
## sexM:age -1.478453 0.102851 -14.375 < 2e-16 ***
## sexM:bodyweight 1.196799 0.067436 17.747 < 2e-16 ***
## equipmentSingle-ply:bodyweight 0.541354 0.062790 8.622 < 2e-16 ***
## equipmentWraps:bodyweight 1.472708 0.553301 2.662 0.007781 **
## age:bodyweight -0.024062 0.002153 -11.176 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 91.26 on 23634 degrees of freedom
## Multiple R-squared: 0.7506, Adjusted R-squared: 0.7505
## F-statistic: 7112 on 10 and 23634 DF, p-value: < 2.2e-16
## [1] 91.25525
The RMSE is slightly high at ~91 and adj. R^2 is around 0.75 whuch tells that ~75% of the variability of the response being explained with these predictors. Let us look at the diagnostics to understand the problem better before we take a call on the model.
Let us begin by looking at the fitted vs studentized residual plot and the QQ plots to check for problems such as non-constant variance and non-normality.
ipf_subset_stats <- ipf_main_subset2 %>%
lm(totalweight ~ sex + equipment + age + bodyweight +
sex:age + sex:bodyweight + equipment:bodyweight + age:bodyweight, data = .) %>%
broom::augment() %>%
mutate(row_num = 1:n())
# Fitted vs Residuals
ggplot(ipf_subset_stats, aes(x = .fitted, y = .std.resid)) +
geom_point(alpha = 0.1) +
geom_hline(yintercept = 0, linetype = "dashed", color = "red2") +
geom_hline(yintercept = c(-2, 2), linetype = "dotted") +
# geom_smooth(color = "forestgreen", alpha = 0.1, se = FALSE) +
xlab("Fitted value") +
ylab("Studentized residual") +
theme_light() +
ggtitle("Fitted values vs residuals") +
theme(plot.title = element_text(hjust = 0.5))# Normality check for residuals
ggplot(ipf_subset_stats, aes(sample = .std.resid)) +
geom_qq(alpha = 0.1) +
geom_qq_line(linetype = "dashed", color = "red2") +
xlab("Theoretical quantile") +
ylab("Sample quantile") +
theme_light() +
ggtitle("Normality check for residuals") +
theme(plot.title = element_text(hjust = 0.5))# serial correlation check for residuals
ggplot(ipf_subset_stats, aes(x = row_num, y = .std.resid)) +
geom_point(alpha = 0.3) +
geom_line() +
geom_hline(yintercept = 0, linetype = "dashed", color = "red2") +
xlab("Index") +
ylab("Stundentized residual") +
theme_light() +
ggtitle("Serial correlation check for residuals") +
theme(plot.title = element_text(hjust = 0.5))We see that this fit has a slight problem with non-constant variance and QQ plot shows that it is not perfectly normal and has some skewness towards the left. We do not see much of an issue with serial correlation although there is a time aspect to this data, which is a good thing. We have achieved a decent fit of the model but let us take another shot to fix the non - constant variance to some extent .
We will apply box-cox transformation on the response variable to fix this and then look at the residual plot and model metrics once again.
bc <- MASS::boxcox(totalweight ~ sex + equipment + age + bodyweight +
sex:age + sex:bodyweight + equipment:bodyweight + age:bodyweight, data = ipf_main_subset2)## [1] 0.5858586
ipf_main_subset2$totalweight <- ((ipf_main_subset2$totalweight ^ lambda) - 1) / lambda
ipf_subset_transformed_stats <- ipf_main_subset2 %>%
lm(totalweight ~ sex + equipment + age + bodyweight +
sex:age + sex:bodyweight + equipment:bodyweight + age:bodyweight, data = .) %>%
broom::augment() %>%
mutate(row_num = 1:n())
ggplot(ipf_subset_transformed_stats, aes(x = .fitted, y = .std.resid)) +
geom_point(alpha = 0.1) +
geom_hline(yintercept = 0, linetype = "dashed", color = "red2") +
geom_hline(yintercept = c(-2, 2), linetype = "dotted") +
# geom_smooth(color = "forestgreen", alpha = 0.1, se = FALSE) +
xlab("Fitted value") +
ylab("Studentized residual") +
theme_light() +
ggtitle("Fitted values vs residuals") +
theme(plot.title = element_text(hjust = 0.5))fit_step_transformed <- lm(totalweight ~ sex + equipment + age + bodyweight +
sex:age + sex:bodyweight + equipment:bodyweight + age:bodyweight, data = ipf_main_subset2)
summary(fit_step_transformed) ##
## Call:
## lm(formula = totalweight ~ sex + equipment + age + bodyweight +
## sex:age + sex:bodyweight + equipment:bodyweight + age:bodyweight,
## data = ipf_main_subset2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -46.839 -4.055 0.212 4.395 20.765
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.2632718 0.5321196 66.269 < 2e-16 ***
## sexM 14.1796247 0.4379012 32.381 < 2e-16 ***
## equipmentSingle-ply 3.4709687 0.3751015 9.253 < 2e-16 ***
## equipmentWraps -1.3096680 3.1656201 -0.414 0.6791
## age -0.0065149 0.0114737 -0.568 0.5702
## bodyweight 0.2890032 0.0071735 40.288 < 2e-16 ***
## sexM:age -0.0928014 0.0074915 -12.388 < 2e-16 ***
## sexM:bodyweight 0.0371200 0.0049120 7.557 4.27e-14 ***
## equipmentSingle-ply:bodyweight 0.0206552 0.0045735 4.516 6.32e-06 ***
## equipmentWraps:bodyweight 0.0892475 0.0403018 2.214 0.0268 *
## age:bodyweight -0.0013089 0.0001568 -8.346 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.647 on 23634 degrees of freedom
## Multiple R-squared: 0.753, Adjusted R-squared: 0.7529
## F-statistic: 7205 on 10 and 23634 DF, p-value: < 2.2e-16
## [1] 6.646928
| x | |
|---|---|
| AIC | 156689.179 |
| BIC | 156786.030 |
| adjR2 | 0.753 |
| RMSE | 6.647 |
| PRESS | 1045334.301 |
| nterms | 11.000 |
| GVIF | Df | GVIF^(1/(2*Df)) | |
|---|---|---|---|
| sex | 23.49840 | 1 | 4.847514 |
| equipment | 181.03323 | 2 | 3.668087 |
| age | 14.06866 | 1 | 3.750822 |
| bodyweight | 14.53333 | 1 | 3.812260 |
| sex:age | 12.12624 | 1 | 3.482275 |
| sex:bodyweight | 26.22333 | 1 | 5.120872 |
| equipment:bodyweight | 231.69624 | 2 | 3.901483 |
| age:bodyweight | 26.80871 | 1 | 5.177713 |
We see that the box-cox transformation has fixed the violation of non-constant variance to an extent. A good outcome of this transformation is that, we managed to decrease the BIC by 44% and and RMSE by ~92% (New RMSE: 6.65). In addition, the VIF are seems to be in range and thus, we can conclude that this is indeed a good fit for the model for this phase of analysis.
Final Model
The model which explains the change in the total weight lifted by the power lifter is -
\[totalweight: sex + equipment + age + bodyweight + sex:age + sex:bodyweight + equipment:bodyweight + age:bodyweight\]
The parameters for each of these predictor variables is shown below -
Model_parameters <- read.csv("modelparameters.csv",stringsAsFactors = FALSE)
Model_parameters$Estimate <- round(Model_parameters$Estimate,2)
Model_parameters$Std_Error <- round(Model_parameters$Std_Error,2)
kable(Model_parameters)| X | Estimate | Std_Error |
|---|---|---|
| (Intercept) | 35.26 | 0.53 |
| sexM | 14.18 | 0.44 |
| equipmentSingle-ply | 3.47 | 0.38 |
| equipmentWraps | -1.31 | 3.17 |
| age | -0.01 | 0.01 |
| bodyweight | 0.29 | 0.01 |
| sexM:age | -0.09 | 0.01 |
| sexM:bodyweight | 0.04 | 0.00 |
| equipmentSingle-ply:bodyweight | 0.02 | 0.00 |
| equipmentWraps:bodyweight | 0.09 | 0.04 |
| age:bodyweight | 0.00 | 0.00 |
According to the coefficients, we see that the parameters which have the most impact are sex and equipment, which is in line with our prior EDA.
Note - The next phase of the project will solely focus on improving the predective capability of the model.
PROBLEM STATEMENT:
The main focus of our analysis is to help both the informed and uninformed readers understand the impact of key factors such as age, sex, bodyweight and equipment usage on the weight lifted across different kinds of powerlifts.
SOLUTION METHODOLOGY:
For better understanding of the impact of these variables, we leveraged graphics to observe the trends of total weight lifted by male and female weight lifters belonging to different age classes . We used categorical variables such as equipment , sex, age class and meet names, to add in context to the analysis and aid in understanding.
We also performed hypothesis testing to statistically prove few observations we saw from the visuals. Finally, we concluded our analysis by fitting a regression model to be able to quantify the relationship of various factors and predict the total amount of weight that can be lifted when provided with input variables such as sex, age, bodyweight and equipment.
FINDINGS & INSIGHTS:
IMPLICATIONS:
This report shows how powerlifting as a sport is gaining traction. It analyzes the historic data to understand what is the best age to achieve peak performance in powerlifting and proves that people over 50 years of age can also lift considerable weights. The reader also understands the section of people who are more likely to get disqualified and that age/ bodyweight do not have any impact on the position of the person in the meet.
The reader is also equipped with a mathematic eqantion to predict the total weight that can be lifted by a powerlifter.
LIMITATIONS:
This report contains a detailed analysis on the factors impacting the performance of powerlifters and how powerlifting, as a sport has evolved over the years. However, this analysis was conducted on a smaller dataset and not on the full dataset. The additional data might give us a little more insight into the analysis. (There is an equally likely chance that it might not, but definitely worth a try).
The modeling part can be improved upon further (This will be done in the next phase of the project).